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Abstract 

We consider a setting of Slepian-Wolf coding, where the random bin of the source vector undergoes 
channel coding, and then decoded at the receiver, based on additional side information, correlated 
to the source. For a given distribution of the randomly selected channel codewords, we propose a 
universal decoder that depends on the statistics of neither the correlated sources nor the channel, 
assuming first that they are both memoryless. Exact analysis of the random-binning/random- 
coding error exponent of this universal decoder shows that it is the same as the one achieved 
by the optimal maximum a posteriori (MAP) decoder. Previously known results on universal 
Slepian-Wolf source decoding, universal channel decoding, and universal source-channel decoding, 
are all obtained as special cases of this result. Subsequently, we further generalize the results in 
several directions, including: (i) finite-state sources and finite-state channels, along with a universal 
decoding metric that is based on Lempel-Ziv parsing, (ii) arbitrary sources and channels, where the 
universal decoding is with respect to a given class of decoding metrics, and (iii) full (symmetric) 
Slepian-Wolf coding, where both source streams are separately fed into random-binning source 
encoders, followed by random channel encoders, which are then jointly decoded by a universal 
decoder. 
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1 Introduction 


Universal decoding for unknown channels is a topic that attracted considerable attention throughout 
the last four decades. In [10], Goppa was the first to offer the maximum mutual information (MMI) 
decoder, which decodes the message as the one whose codeword has the largest empirical mutual 
information with the channel output sequence. Goppa proved that for discrete memoryless channels 
(DMC’s), MMI decoding attains capacity. Csiszar and Korner [3] Theorem 5.2] have further showed 
that the random coding error exponent of the MMI decoder, pertaining to the ensemble of the 
uniform random coding distribution over a certain type class, achieves the same random coding 
error exponent as the optimum, maximum likelihood (ML) decoder. Ever since these early works 
on universal channel decoding, a considerably large volume of research work has been done, see, 
e.g., 0, 0,0, psi, m, m, nsi, m, for a non-exhaustive list of works on memory less channels, 
as well as more general classes of channels. 

At the same time, considering the analogy between channel coding and Slepian-Wolf (SW) source 
coding, it is not surprising that universal schemes for SW decoding, like the minimum entropy (ME) 
decoder, have also been derived, first, by Csiszar and Korner [0], Exercise 3.1.6], and later further 
developed by others in various directions, see, e.g., pq , m , m, m , m, m- 

Much less attention, however, has been devoted to universal decoding for joint source-channel 
coding, where both the source and the channel are unknown to the decoder. Csiszar 0 was the 
first to propose such a universal decoder, which he referred to as the generalized MMI decoder. The 
generalized MMI decoding metric, to be maximized among all messages, is essentialljQ given by the 
difference between the empirical input-output mutual information of the channel and the empirical 
entropy of the source. In a way, it naturally combines the concepts of MMI channel decoding and 
ME source decoding. But the emphasis in [2] was inclined much more towards upper and lower 
bounds on the reliability function, whereas the universality of the decoder was quite a secondary 
issue. Consequently, later articles that refer to [2] also focus, first and foremost, on the joint source- 
channel reliability function and not really on universal decoding. We are not aware of subsequent 
works on universal source-channel decoding other than m which concerns a completely different 
setting, of zero-delay coding. 

In this work, we consider universal joint source-channel decoding in several settings that are all 
more general than that of [2j. In particular, we begin by considering the communication system 
depicted in Fig. [T] which is described as follows: A source vector u, emerging from a discrete 
memoryless source (DMS), undergoes Slepian-Wolf encoding (random binning) at rate R, followed 
by channel coding (random coding). The discrete memoryless channel (DMC) output y is fed into 
the decoder, along with a side information (SI) vector v, correlated to the source u, and the output 
of the decoder, u, should agree with u with probability as high as possible. 

Our first step is to characterize the exact exponential rate of the expected error probability, 
associated with the optimum MAP decoder, where the expectation is over both ensembles of the 

1 Strictly speaking, Csiszar’s decoding metric is slightly different, but is asymptotically equivalent to this definition. 
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random binning encoder and the random channel code. We refer to this exponential rate as the 
random-binning/random-coding error exponent. The second step, which is the more important 
one for our purposes, is to show that this error exponent is also achieved by a universal decoder, 
that depends neither on the statistics of the source, nor of the channel, and which is similar to 
Csiszar’s generalized MMI decoder. Beyond the fact this model is more general than the one in 
[2] (in the sense of including the random binning component as well as decoder SI), the assertion 
of the universal optimality of the generalized MMI decoder is stronger here than in [2]. In [2] the 
performance of the generalized MMI decoder is compared directly to an upper bound on the joint 
source-channel reliability function, and the claim on the optimality of this decoder is asserted only 
in the range where this bound is tight. Here, on the other hand (similarly as in earlier works on 
universal pure channel decoding), we argue that the generalized MMI decoder is always asymptot¬ 
ically as good as the optimal MAP decoder, in the error exponent sense, no matter whether or not 
there is a gap between the achievable exponent and the upper bound on the reliability function. 
In other words, like in previous works on universal decoding, the focus is on asymptotic optimality 
of the decoder for the average code and for an unknown channel, rather than on optimality of the 
overall communication system. However, as we shall see later on, since full optimization of the 
random coding ensemble is infeasible, due to channel uncertainty, the best one can hope for is the 
MAP source-channel error exponent due to Gallager |i9j Problem 5.16]. We also provide an upper 
bound to the error exponent for any communication system with the configuration of Fig. [T] and 
discuss the conditions under which it is met. 



u 


Figure 1: Slepian-Wolf source coding, followed by channel coding. The source u is source-channel encoded, 
whereas the correlated SI v (described as being generated by a DMC fed by u) is available at the 
decoder. 

One motivation for studying this model is that it captures, in a unified framework, several 
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important special cases of communication systems, from the perspective of universal decoding. 

1. Separate source coding and channel coding without SI: letting v be degenerate. 

2. Pure SW source coding: letting the channel be clean (y = x) and assuming the channel 
alphabet to be very large (so that the probability for two or more identical codewords would 
be negligible). 

3. Pure channel coding: letting the source be binary and symmetric, and the SI be degenerate. 

4. Joint source-channel coding with and without SI: letting the binning rate R be sufficiently 
large, so that probability of ambiguous binning (i.e., when two or more source vectors are 
mapped into the same bin) is negligible. In this case, the mapping between source vectors 
and channel input vectors is one-to-one with high probability, and therefore, this is a joint 
source-channel code. More details on this aspect will follow in the sequel. 

5. Systematic coding: letting the SI channel (from u to v ) in Fig. |T] be identical to the main 
channel (from x to y), and then the SI channel may represent transmission of the systematic 
(uncoded) part of the code (see discussions on this point of view also in [20] and [30]). 

Another motivation is that it serves as the basis for the more important part of the paper, where 
we provide three further extensions of this communication system model. In at least two of these 
more general situations, the analysis is more tricky and several difficulties that are encountered 
need to be handled with care. The extended scenarios are the following. 

1. Extending the scope from memoryless sources and channels to finite-state sources and finite- 
state channels. Here, the universal joint source-channel decoding metric is based on Lempel- 
Ziv (LZ) parsing, with the inspiration of [33]. The non-trivial parts of the analysis (not 
encountered in [33] or other related works) are mainly those described in items 1, 7 and 8 of 
Subsection 5.1. 

2. Further extending the scope to arbitrary sources and channels, but allowing a given, limited 
class of reference decoding metrics. We propose a universal joint source-channel decoder with 
the property that, no matter what the underlying source and channel may be, this universal 
decoder is asymptotically as good as the best decoder in the class for these source and channel. 
This extends the recent study in [25] , from pure channel coding to joint source-channel coding. 

3. Generalizing to the model to separate encodings (source binning followed by channel coding) 
and joint decoding of two correlated sources (see Fig. [2] in Section 5.3). Here the universal 
decoder must handle several types of error events due to possible ambiguities in the binning 
encoder. As a consequence of this fact, the proposed universal decoding metric for this 
scenario is surprisingly different from what one may expect. 

Finally, a few words are in order concerning the error exponent analysis. The ensemble of codes 
in our setting combines random binning (for the source coding part) and random coding (for the 
channel coding part), which is considerably more involved than ordinary error exponent analyses 
that is associated with either one but not both. This requires a rather careful analysis, in two steps, 
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where in the first, we take the average probability of error over the ensemble of random binning 
codes, for a given channel code, and at the second step, we average over the ensemble of channel 
codes. The latter employs the type class enumeration method m Chap. 6], which has already 
proved rather useful as a tool for obtaining exponentially tight random coding bounds in various 
contexts (see, e.g., [22] . [23], [24] for a sample), and this work is no exception in that respect, as 
the resulting error exponents are tight for the average code. 

The remaining part of the paper is organized as follows. In Section 2, we establish notation 
conventions, formalize the model and the problem, and finally, review some preliminaries. Section 
3 provides the main result along with some discussion. The proof of this result appears in Section 
4, and finally, Section 5 is devoted for the various extensions described above. 

2 Notation Conventions, Problem Setting and Preliminaries 

2.1 Notation Conventions 

Throughout the paper, random variables will be denoted by capital letters, specific values they may 
take will be denoted by the corresponding lower case letters, and their alphabets will be denoted by 
calligraphic letters. Random vectors and their realizations will be denoted, respectively, by capital 
letters and the corresponding lower case letters, both in the bold face font. Their alphabets will 
be superscripted by their dimensions. For example, the random vector X = (X \...., X n ), (n - 
positive integer) may take a specific vector value x = (x\, ..., x n ) in X n , the n-th order Cartesian 
power of X, which is the alphabet of each component of this vector. Sources and channels will be 
denoted by the letter P, Q, or W, subscripted by the names of the relevant random variables/vectors 
and their conditionings, if applicable, following the standard notation conventions, e.g., Qx, Py\Xi 
and so on. When there is no room for ambiguity, these subscripts will be omitted. To avoid 
cumbersome notation, the various probability distributions will be denoted as above, no matter 
whether probabilities of single symbols or n-vectors are addressed. Thus, for example, Pu{u) (or 
P(u)) will denote the probability of a single symbol u £ U, whereas Pjj(u) (or P(u )) will stand 
for the probability of the n-vector u £ U n . The probability of an event £ will be denoted by 
Pr{£}, and the expectation operator with respect to (w.r.t.) a probability distribution P will be 
denoted by E{-}. The entropy of a generic distribution Q on X will be denoted by TL(Q). For 
two positive sequences a n and b n , the notation a n = b n will stand for equality in the exponential 
scale, that is, lim^oo ^ log jp = 0. Accordingly, the notation a n = 2~ noc means that a n decays at 
a super-exponential rate (e.g., double-exponentially). Unless specified otherwise, logarithms and 
exponents, throughout this paper, should be understood to be taken to the base 2. The indicator 
function of an event £ will be denoted by I{E}. The notation [#]+ will stand for max{0, x}. The 
minimum between two reals, a and b, will frequently be denoted by a A b. The cardinality of a finite 
set, say A, will be denoted by |*4|. 

The empirical distribution of a sequence x £ X n , which will be denoted by Px, is the vector of 
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relative frequencies P x (x) of each symbol x G X in x. The type class of a: € X n , denoted T(x), 
is the set of all vectors x' with P X > = Px- When we wish to emphasize the dependence of the 
type class on the empirical distribution, say Q , we will denote it by T(Q). Information measures 
associated with empirical distributions will be denoted with ‘hats’ and will be subscripted by the 
sequences from which they are induced. For example, the entropy associated with Px, which is the 
empirical entropy of x, will be denotecH by H x (X). Again, the subscript will be omitted whenever 
it is clear from the context what sequence the empirical distribution was extracted from. Similar 
conventions will apply to the joint empirical distribution, the joint type class, the conditional 
empirical distributions and the conditional type classes associated with pairs of sequences of length 
n. Accordingly, P X y would be the joint empirical distribution of (x,y) = {(xj, Vi)}? = i, T{x,y) or 
T(Pxy) will denote the joint type class of (x,y), T{x\y) will stand for the conditional type class 
of x given y, H X y(X,Y) will designate the empirical joint entropy of x and y , H X y(X\Y) will be 
the empirical conditional entropy, I X y(X ; Y) will denote empirical mutual information, and so on. 

2.2 Problem Setting for the Basic Setting 

Let (U,V) = {{Ut,Vt)}t = i be n independent copies of a pair of random variables, (JJ, V) ~ Pjjv, 
taking on values in finite alphabets, U and V, respectively. The vector U will designate the source 
vector to be encoded, whereas the vector V will serve as correlated SI, available to the decoder. 
Let W designate a DMC, with single-letter, input-output transition probabilities W(y\x), x € X, 
y € y, X and y being finite input and output alphabets, respectively. When the channel is fed by 
an input vector x 6 X n , it produce^ a channel output vector y £ y n , according to 


W(y\x) = Y[W(y t \x t ). 


( 1 ) 


t=l 


Consider the communication system depicted in Fig. [0 When a given realization u = {u \,..., u n ), 
of the source vector U , is fed into the system, it is encoded into one out of M = 2 nR bins, selected 
independently at random for every member of U n . Here, R > 0 is referred to as the binning 
rate. The bin index j = f{u ) is mapped into a channel input vector x(j ) € X n , which in turn is 
transmitted across the channel W. The various codewords {xij)}^^ are selected independently at 
random under the uniform distribution across a given type class T(Q), Q being a given probability 
distribution over ff]f| The randomly chosen codebook {x(l),x(2),... ,x (M)} will be denoted by 
C. Both the channel encoder, C, and the realization of the random binning source encoder, /, 


2 Note that here we use the letter H in the ordinary font, as opposed to the earlier defined notation of the entropy 
as a functional of a distribution, where we used the calligraphic T-L. 

3 Without essential loss of generality, and similarly as in [2], we assume that the source and the channel operate at the 
same rate, so that while the source emits the n-vector (U, V ), the channel is used n times exactly, transforming 
x £ X n to y £ y n . The extension to the case where operation rates are different (bandwidth expansion factor 
different from 1) is straightforward but is avoided here, in the quest of keeping notation and expressions less 
cumbersome. 

4 Rather than the same type T(Q ) for all bins, a more general ensemble may allow different types of codewords to 
bins of different types of source vectors. We will address this point in the next section. 
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are revealed to the decoder as well. With a slight abuse of notation, we will sometimes denote 
x(j ) = x[f(u)] by x[u\. The optimal (MAP) decoder estimates u, using the channel output 
y = (j/i,..., y n ) and the SI vector v = (ui,. .., v n ), according to: 

u = argimax P(u, v)W{y\x[u\). (2) 

The average probability of error, P e , is the probability of the event {U ^ U}, where in addition 
to the randomness of (U, V) and the channel output Y, the randomness of the source binning 
code and the channel code are also taken into account. The random-binning/random-coding error 
exponent, associated with the optimal, MAP decoder, is defined as 

- 1^1 , ( 3 ) 

n 

provided that the limit exists (a fact that will become evident from the analysis in the sequel). 

The first step is to derive a single-letter expression for the exact random-binning/random-coding 
error exponent E(R,Q). While the MAP decoder depends on the source P and the channel W, 
the second step is to propose a universal decoder, independent of P and W, whose average error 
probability decays exponentially at the same rate, E(R,Q). Note that we are considering a fixed 
Q without attempt to maximize E(R,Q ) w.r.t. Q since the maximizing Q normally depends on 
the unknown channel W (more on this in the next subsection). Finally, our main goal is to extend 
the scope beyond memoryless systems, as well as to the setting where the role of V is no longer 
merely to serve as SI at the decoder, but rather as another source vector, encoded similarly, but 
separately from U (see Fig. [21). 


E(R, Q ) = lim 

x ' n—> oo 


2.3 Preliminaries - the Joint Source-Channel Error Exponent 


To the best of our knowledge, the first to consider error exponents for joint source-channel coding 
(without SI) was Gallager (see also Jelinek [Tl]). In the second part of Problem 5.16 in his textbook 
M (pp- 534-535), the reader is requested to prove that for a given DMS P, a given DMC W, and 
a given product distribution Q for random selection of a channel input vector x[u] for each source 
vector u , the average probability of error is upper bounded by 


< exp < — n max 
II 0<p<l 


E 0 (p, Q) ~ (1 + p) In ^ [P(u)} 


V(i+ p) 


\u£U 




(4) 


where Eq(p,Q) is the well-known Gallager function 


Eo(p,Q ) = - In 

y&y \-x&X 

It is easy to show (see Appendix) that this exponential upper bound is equivalent to 


y Q(x)W(y\x) 1 ^ 1+p ' > 


i +p 


P e < exp < — ti min 

I H{P)<R<log\U\ L 


E s (R) + E c r (R,Q) 


(5) 

( 6 ) 
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where E S (R) is the source reliability function [16] , given by 

E a (R) = min P>(P'||P), 

{ P': H(P')>R } 

D(P'\\P) being the Kullback-Leibler divergence between P' and P , and 


e c ar,q) 


max [ E 0 (p , Q ) - pR] 

0<p<l 


( 7 ) 

( 8 ) 


Slightly more than a decade later, Csiszar [2] derived upper and lower bounds on the reliability 
function of lossless joint source-channel coding (again, without SI). Csiszar has shown, in that 
paper, that the reliability function, E isc , of lossless joint source-channel coding is upper bounded 
by 

E iac < min [E a (R) + E C (R)\ (9) 

H(P)<R<log\U\ 

where E C (R) is the channel reliability function, for which there is a closed-form expression available 
only at rate zero (the zero-rate expurgated exponent) and at rates above the critical rate (the 
sphere-packing exponent). The lower bound in |2J is given by 

E iac > min [E S (R) + E a (R)}, (10) 

R 

where E a (R) is the random coding error exponent of the channel W and where we have relaxed the 
constraint on the range of R since this unconstrained minimum is attained in that range anyway 
(see |2, p. 323, one line before the Remark]). The upper and the lower bounds coincide (and hence 
provide the exact reliability function) whenever the minimizing R, of the upper bound @, exceeds 
the critical rate of the channel W. 

Note that the difference between the achievable exponents of Gallager and Csiszar is in the 
channel error exponent terms. In the former, it is E a (R,Q), whereas in the latter it is improved 
to E a {R) = rnaxg E C (R, Q). The reason is that, while Gallager uses the same type random coding 
distribution Q for all codewords {aj[it]}, Csiszar partitions the source space according to the various 
types {P 1 } and maps each such type into a channel subcode whose rate is essentially R = 'H(P') and 
for which the channel input type Q is optimized according to rnaxQ E a (7~L(P'), Q ). The difference 
disappears, of course, if for the channel W, the same Q maximizes E C (R,Q) for every R. This 
happens, for example, for the modulo-additive channel Y = X © N (N being independent of X), 
where the uniform distribution Q is optimal independently of R. 

An expression equivalent to (11011 is given by 

E iac > nunmaxmin{T)(P , ||P) + D(W'\\W\Q) + [/(A; Y') - H(U')]+} (11) 

where U' is an auxiliary random variable drawn by a source P' over IA (hence H(U') = P(P / )), 
X is governed by Q, Y' designates the output of an auxiliary channel W' : X —> y fed by A, and 
D(W'\\W\Q) is the Kullback-Leibler divergence between W' and W, weighted by Q, that is 

D(W'\\W\Q) = Y, Q(®) E w '(v\ x ) log 


(12) 



Here, the term D(P'\\P) is parallel to the source coding exponent, E B (R), whereas the sum of other 
two terms can be referred to the channel coding exponent E B (R) (see [2]). This is true since the 
minimization over P' in (1111) can be carried out in two steps, where in the first, one minimizes over 
all {T"} with a given entropy R(P') = R (thus giving rise to E B (R) according to ©), and then 
minimizes over R. 

In a nutshell, the idea behind the converse part in [2j is that each type class P' , of source 
vectors, can be thought of as being mapped by the encoder into a separate channel subcode at 
rate R = T-L(P'), and then the probability of error is lower bounded by the contribution of the 
worst subcode. This is to say that for the purpose of the lower bound, only decoding errors within 
each subcode are counted, whereas errors caused by confusing two source vectors that belong two 
different subcodes, are ignored. An interesting point, in this context, is that whenever the upper 
and the lower bound coincide (in the exponential scale), this means that confusions within the 
subcodes dominate the error probability, at least as far as error exponents are concerned, whereas 
errors of confusing codewords from different subcodes are essentially immaterial. We will witness 
the same phenomenon from a different perspective, in the sequel. 

For the achievability part of [2|, Csiszar analyzes the performance of a universal decoder, that is 
asymptotically equivalent to the following: 

u = argma x.[I x[u]y (X-Y) - H U {U)\. (13) 

As mentioned earlier, Csiszar refers to his decoder as the generalized MMI decoder. An important 
point to observe, however, is that in this universal setting, it makes sense to assume that the 
encoder does not know the channel W either and hence cannot match the optimal channel input 
type Q to every given source type rate R = 1-L(P') as described above, because this optimal type 
depends on the unknown channel W (see [21 page 323, Remark]). Csiszar suggests, in this case, to 
select a fixed type Q for all source types, in which case the achievable exponent becomes the same 
as Gallager’s exponent. Throughout this paper, we shall adopt this suggestion, and for this reason, 
we have defined our objective (in the previous section) to universally achieve E(R, Q ) for a given 
Q without attempt to optimize Q, 

3 Results for the System of Fig. |T| 

Our problem setting and results are more general than those of |]2J from the following aspects: 
(i) we include side information V, (ii) we include a cascade of random binning encoder and a 
channel encoder (separate source- and channel coding), and (iii) we compare the performance of 
the universal decoder to that of the MAP decoder Q and show that they always (i.e., even when 
the random coding ensemble is not good enough to achieve the reliability function) have the same 
error exponent, whereas Csiszar compares the performance of (1131) to the upper bound ® and thus 
may conclude for asymptotic optimality of the decoder (together the encoder) only when the exact 
joint source-channel reliability function is known. 
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Concerning (ii), one may wonder what is the motivation for separate source- and channel coding, 
because joint source-channel coding is always at least as good. The answer to this question is two¬ 
fold: 

1. In some applications, system constraints dictate separate source- and channel coding, for 
example, when the two encodings are performed at different units/locations or when general 
engineering considerations (like modularity) dictate separation. 

2. The joint source-channel setting, for a fixed channel input type Q, can always be obtained 
as a special case, by choosing the binning rate R sufficiently high, since then the binning 
encoder is a one-to-one mapping with an overwhelmingly high probability and the channel 
code in cascade to the binning code is equivalent to a direct mapping between source vectors 
and channel input vectors. 

Our main result is given by the following theorem. 

Theorem 1 Consider the problem setting defined in Subsection 2.2. 

(a) The random-binning/random-coding error exponent of the MAP decoder is given by 

E(R,Q) = min {D(P UIV ,\\P UV ) +D(W'\\W\Q) + [R A I(X-,Y')-H(U'\V')\ + } (14) 

Pu'v',W' 

where (U' ,V') £WxV are auxiliary random variables jointly distributed according to Pu'v, 
and Y 1 £ y is another auxiliary random variable that designates the output of channel W' 
when fed by X ~ Q. 

(b) The universal decoders, 


u = argmax[I x[u]y (X;Y) - H U v{U\V)\ 

(15) 

and 


u = argmmx[R A I X [u\y( x i Y ) - H U v{U\V)], 

(16) 

both achieve E(R,Q). 



Decoder (fT51) is, of course, a natural extension of (fTHD to our setting. As for ([TUI) . while it offers 
no apparent advantage over csD, it is given here as an alternative decoder for future reference. It 
will turn out later that conceptually, (1161) lends itself more naturally to the extension that deals 
with separate encodings and joint decoding of two correlated sources, where in the extended version 
of (USD, it will not be obvious (at least not to the author) that the operator R A (•) is neutral (i.e., 
an expression like RAx can be simply replaced by x, as is indeed suggested here by the equivalence 
between (|15D and (1161) 1. Another interesting point concerning (1161) . is that it appears more clearly 
as a joint extension of the MMI decoder of pure channel decoding and the ME decoder of pure 
source coding. When R dominates the term R A Ix[u]y (A; Y), the source coding component of the 
problem is more prominent and (1161) is essentially equivalent to the ME decoder. Otherwise, it is 
essentially equivalent to (fT51) . 
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As can be seen, E(R , Q ) is monotonically non-decreasing in but when R is sufficiently large, 
the term R A I(X;Y') is dominated by I(X;Y') (say, R = log|Aj), which yields saturation of 
E(R, Q ) to the level of the joint source-channel random coding exponent (for a given Q), similarly 
as in m, except that here, the entropy H(U') is replaced by the conditional entropy H(U'\V'), 
due to the SI. Obviously, if V' is degenerate (e.g., equal to a fixed v € V with probability one), then 
we are back to <tm>- For another extreme case, if the channel is clean, Q is uniform, and the channel 
alphabet is very large, then I(X ; Y') = H(X) = log \X\ is large as well, and then R A I(XY) is 
dominated by R. In this case, we recover the SW random binning error exponent (see, e.g., [52] 
and references therein). 

Finally, although not directly related to the aspect of universality, for the sake of completeness 
(and in analogy to [2]), we next provide also a converse bound that applies to any communication 
system of the type depicted in Fig. [TJ where both the binning code / : U n -A {1,2,..., M} and 
the channel code, that maps {1, 2,..., M} into C, are arbitrary (and deterministic), and where the 
optimal MAP decoder is used. It is not difficult to extend Lemma 2 of [2] to the scenario under 
discussion and to argue that given the source PjjVi the channel W, and the binning rate R, the 
highest achievable source-channel error exponent, E(Puy, W, R) is upper bounded by 

E(P UV , W, R) < min {D(Pu' V , \\P UV ) + Z{H(U'\V) < R} • E°[H(U'\V')]} , (17) 

P U'V'’ W ' 

which follows immediately from the simple consideration of viewing each conditional type E(u'\v') 
(whose weight is exponentially 2 ~ nD ^ p u'v'\\ p uv)^ a s being encoded by a channel subcode at rate 
R = H(U'\V'), which is the corresponding empirical conditional entropy. Now, as long as R < R, 
this conditional type may be mapped into the channel code without loss of information and then 
the error probability within this subcode is lower bounded by 2 -r, [ £ ' c (' R ) + °( r d] (as all source messages 
originating from T{u'\v') are equally likely given v'). However, if R > R most of the members 
of T(u'\v) are mapped ambiguously by the binning encoder and the probability of error goes to 
unity even without the channel noise, hence the factor X{E[{U'\V') < R} in the second term. If we 
further upper bound E c (-) by the sphere-packing exponent, then for the second term of the above 
we have 

1{H(U'\V') < R} ■ E c ap [H(U'\V')} = maxmm{D(W'\\W\Q) : RAl(X;Y') < H{U'\V')}. (18) 

Q 

To see why this is true, one simply examines the two cases, H{U'\V') < R and H{U'\V') > R. In 
the former case, the constraint RAl(X;Y') < H(U'\V') is equivalent to I(X;Y') < H(U'\V'), and 
then both the right-hand side (r.h.s.) and the left-hand side (l.h.s.) become E ap [H(U'\V')\. In the 
latter case, the constraint is trivially met for every IF', including the choice W' = W for which the 
r.h.s. vanishes, exactly like the l.h.s. Putting this together, we have 

E(P UV ,W, R) < min max min [D{P WVI \\P UV ) + D(W'\\W\Q)\. (19) 

Pu' V , Q {W-. RAI(X-Y')<H(U'\V')} 

5 This fact is not completely trivial, since an increase in R improves the source binning part, but one may expect 
that it harms the channel coding part. Nonetheless, as will become apparent in the sequel (see footnote 5), the 
combined effect of source binning and channel coding gives a non-decreasing exponent as a function of R. 
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E{R, Q ) of Theorem 1 meets this upper bound whenever the minimizing Pij'v’ and W' of E(R, Q ) 
are such that R A /(X; Y r ) < H(U'\V') and that the minimization over Pij’V’ can be interchanged 
with the maximization over Q, e.g., when the optimal Q is independent of the coding rate, as 
discussed abovejfl The first condition can be stated differently as follows: if we formally de¬ 
fine E B (R ) = mm{D(Pij'v'\\Puv) ■ H{U'\V') = R} (which depends solely on the source) and 
E C (R, R, Q) = minwi{D(W'\\W\Q) + [R A I(X; Y') — R]+} (which depends solely on the channel), 
then the upper bound is attained if the value of R. that minimizes [E a (R) + E a (R, R,Q)] is large 
enough such that E*(R, R, Q ) is achieved by W' for which R A I(X'; Y) < R. 


4 Proof of Theorem 1 


The outline of the proof is as follows. We begin by showing that E(R, Q ) is an upper bound on the 
error exponent associated with the MAP decoder, and then we show that both universal decoders 
m and (U6p attain E{R, Q). The combination of these two facts will prove both parts of Theorem 
1 at the same time. 

As a first step, let the channel codebook C, as well as the vectors u, v, x = x[u] and y be given, 
and let P e (u,v,x,y,C) be the average error probability given (u,v,x,y,C), where the averaging 
is w.r.t. the ensemble of random binning source codes. For a given u' ^ u, let us define the set 

A(u,u',v,x,y) = T(Q)f){x' : P{u',v)W(y\x') > P{u,v)W(y\x)} . (20) 

The conditional error event, given ( u,v,x,y,C ), is given by 

£(u,v,x,y,C) = |J {P(u',v)W(y\x[u']) > P(u,v)W{y\x[u})} 

w+u 

= U £(u,u',v,x,y,C) (21) 

u'+u 

The probability of the pairwise error event, £(u, u', v, x, y, C) (again, w.r.t. the randomness of the 
bin assignment), is given by: 


Pr {£(u,u',v,x,y,C)} = 2 


—nR 


Cf]A(u,u',v,x,y ) 


+ 2 - nR l{P(u',v) > P{u,v)}. 


( 22 ) 


Here, the first term accounts for the probability to randomly choose a bin, other than f{u) , which 
is mapped to a channel input vector whose likelihood score is larger than P(u,v)W(y\x[u]). The 
second term is associated with the probability that f(u!) = f(u ) (which is 2~ nR ), in which case 
the factor W{y\x[u'\) = W(y\x[u]) cancels out in the pairwise likelihood score comparison, and 
so, u' prevails if P(u',v ) > P(u',v). Now, 


Pe(u, v, x, y,C) 

6 Note that here, as opposed to (3J, the independence of the optimal Q upon R is needed (in order to match the 
converse) even when the channel is known, because the encoder is unaware of the virtual rate R = H(U'\V') due 
to the unavailability of v at the encoder. 
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= Pr {£(u,v,x,y,C)} 

= E Pl '| U £(u,u",v,x,y,C)} 

{T(u'\v)} [u"eT(u'\v) 


E 

{T(W\v)} 


mm 


i,ir(u»|-2 


—nR 


Cf)A{u,u',v,x,y) 


+ Z{P(v!,v) > P(u,v)j 


,(23) 


where we have used the fact that A(u, u", v, x, y) = A(u, u', v, x, y) and P(u",v) = P(u',v ) 
for u" € T(u'\v) and where the exponential tightness of the truncated union bound (for pairwise 
independent events) in the last expression is known from [31, Lemma A.2, p. 109] and it can also 
be readily deduced from de Caen’s lower bound on the probability of a union of events |5j. The 
next step is to average over the randomness of C (except the codeword for the bin of the actual 
source vector u , which is still given to be x): 


Pe(u , v , x , y) = E C \ {X} { P e (u , v, x, y, C 


y~] E ( min l 1, \T(u'\v)\ ■ 2 nR C Q A(u, u', v, x, y) 
{T(u'\v)} ^ ^ *■ 

l{P(u',v)>P(u,v)}]}), 


+ 


(24) 


where stands for expectation over the randomness of all codewords in C other than x = x[u\. 

Now, using the identity E{Z} = / 0 °°Pr{Z > t}dt, which is valid for any non-negative random 
variable Z, we have 


E ^min |l, |T( , u , |'u)| • 2 nR C Q A(u, u', v, x, y) 
J dt ■ Pr | \T(u'\v)\ ■ 2~ nR CP| A(u, u', v, x, y) 


+ Z{P{u',v) > P(u,v )} 
+ Z{P(u',v) > P(u,v)} 


> t 


dt • Pr < Z{P(u', v ) > P(u , u)} + ^Z[X(i) € A(u, u , v, x, y)\ > 


t ■ 2 nR 
\T(u'\v)\ 


= n 


poo 

In 2 • / dd- 2~ nd Pr {Z{P(u', v ) > P(u , u)}+ 

Jo 

Y,AX{i)eA(u,u',v,x,y)}> 2 A r -6-A(u'\ v )]\ 


(25) 


where in the last passage, we have used the shorthand notation H(U'\V) for Hu'v(U\V), and we 
have changed the integration variable from t to 0, according to the relation f = 2~" e . 

Consider first the case where P(u',v ) > P(u,v). Then, the integrand is given by 


2~ nd ■ Pr 


1 + E (*) € A( u , u',v,x, y)] 

i 


> 2 


n[R-d-H(U'\V )] 


(26) 
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in which the second factor is obviously equal to unity for all 9 > [R — H{U' |V)]+. Thus, the tail of 
the integral (|25f) is given by 

n In 2 ■ d 9 ■ 2~ n6 = 2~ n l R - R ( u '\ v ^+. (27) 

J{R-H(U'\V)] + 

For 9 < [R — H(U'\V)\ + , the unity term in (fZUll can be safely neglected, and 


Pr|^Z[X(i) E A(u, u', v, x, y)] > 2 n l R - e ~ R (. u '\ v )l j (28) 

is the probability of a large deviations event associated with a binomial random variable with 2 nR 
trials and probability of success of the exponential order of 2 ~ nJ , with J being defined as 

J = min j/pf'; Y) : P X ’y is such that x' € T(Q)f)A(u,u',v,x,y)} , (29) 

where I(X';Y ) is shorthand notation for I X 'y(X]Y ) and where it should be kept in mind that J 
depends on Puv, Pu’Vi an d Pxy ■ According to [2T[ Chap. 6], the large deviations behavior is as 
follows: 


Pr|^Z[X(i) E A(u,u',v,x,y)\ > 2 n ^ R - e - fl{ - u '\ v ^ 

( 2 -pJ-R}+ R-0 - H(U'\V) <[R- J]+ 

\ 2~ no ° R-0- H(U'\V) > [R - J]+ 

I 2 -n[J-R]+ Q > [R-H(U'\V) - [R- «/]+]+ 

1 2~ no ° elsewhere 


(30) 


Thus, the other contribution to (1251) is given by 

r[R-H{U'\V)]+ 

n In 2- / d0 ■ 2~ nd ■ 2^-^+ 

J[R-H(U'\V)-[R-J] + ]+ 

= exp 2 {-n([R - H(U'\V) - [R - J]+]+ + [J - 7?]+)} 

= exp 2 {—n([i? A J - H(U'\V)\+ + [J - R]+)} 

= exp 2 {-n{[R A J - H(U'\V)\+ -RAJ + RAJ+[J- i?]+)} 

= exp 2 {—n[—i? A J A H{U'\V) + J]} 

= exp 2 {—n[J — R A H(U'\V)\+}, (31) 


where we have repeatedly used the identity a — [a — b] + = a A b. Thus, the total conditional error 
exponent, for the case P(u',v) > P(u,v ), is given by 


min {[R - H(U'\V)\ + , [J - R A H(U'\V)}+} 

= min {R — R A H(U'\V), J — R A J A H{U'\V)} 
= RAJ-H{U'\V)] , 


(32) 
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where the last line follows from the following consideration^ If H(U'\V) > J, then all three 
expressions obviously vanish and then the equality is trivially met. Otherwise, H(U'\V) < J 
implies that the term R A H(U'\V), in the second line, can safely be replaced by I? A J A H(U'\V), 
which makes the second line identical to RAJ — RAJ A H(U'\V) = RAJ — H(U'\V) . In the 

case, P(u',v) < P(u,v), the conditional error exponent is just [J — R A H(U'\V)] + . 

Let Eq(P uv , Pu'v, Pxy) denote the overall conditional error exponent given (u, u', v, x, y), i.e., 


Eq (Puv i Pu’v i Pxy) 


RAJ - H{U’\V) 
J - RAH(U'\V) 


P(u ', v) > P(u, V ) 

otherwise 


(33) 


Finally, by averaging the obtained exponential estimate of P e (U, V,X, Y) over the randomness of 
(U,V,X,Y), and using the method of types in the standard manner, we obtain 

E(R, Q) = lim min [D(P UV \\P UV ) + D(P ylx \\W\Q) + E^Puv, P X y)\, (34) 
n ^°°PuvPxy 


where Px is constrained to coincide with Q and 


E\(Puv j Pxy ) — min Eq(P U Vi Pu'Vi Pxy)- 


u’v 


An obvious upper 


bouncj§ 


is obtained by 


Ei(Puv,Pxy ) < Eo(P uv , P U v, P. 


xy) 


< 


RAI{X-Y)-H(U\V) 


= El (Puv, Pxy) 


(35) 


(36) 


where we have used the fact that x £ A(u, u, v, x, y) and so, for Pu’v = Puv, one has J < I(X-Y). 
Thus, 


E(R,Q) < min [D(P U , V .\\P UV ) + D(W'\\W\Q) + E^Pu^., Q X W')} 

Pu'v',W 

= min {D(Pu' V , || Puv) + D(W’\\W\Q) + [R A I(X ; Y’) - H(U'\V')\ + } 

Pu'voW' ^ 

= E V (R,Q). (37) 

'The first line of (1321) corresponds to the worst between the source coding exponent, [R — H(U'\V)\+, and the 
channel coding exponent, [J — R A H(U'\V)]+, which is to be expected in separate source- and channel coding. 
While the former is non-decreasing in R, the latter is non-increasing. From the last line of we learn that the 
overall exponent is non-decreasing in R. 

8 We are upper bounding the minimum of Ei over {Pu'v} by the value of Eo where Pu'v = Puv , and will shortly 
see that this bound is actually tight. This means that the error exponent is dominated by erroneous vectors { u '} 
that are within the same conditional type (given v) as the correct source vector u. This is coherent with the 
observation discussed in Subsection 2.3, that errors within the subcode pertaining to the same type class dominate 
the error exponent. 
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We next argue that the universal decoder (fT5|) achieves Ejj(R, Q ) and hence E(R, Q ) = Ejj(R , Q)- 
To see why this is true, one repeats exactly the same derivation, with the following two simple 
modifications: 

1. A(u, u', v, x, y) is replaced by 

A{u, u', v, x, y ) = T(Q) n {x’ : J(X'; Y) - H{U'\V) > J(X; Y) - H(U\V)} (38) 

and accordingly, J is replaced by 

J = mm{I(X'-,Y) : I(X'; Y) — H(U'\V) > I(X;Y) — H(U\V)}. (39) 


2. The indicator function I{P(u',v) > P(u,v )} is replaced by 1{H(U'\V) < H(U\V)}. 
The result is then similar except that Eq(P U v, Pu'v > Pxy) is replaced by 


Eq{Puv ; Pu’vi Pxy) 


RAJ - H(U'\V)]+ H(U'\V) < H(U\V) 

J — RAH(U'\V) otherwise 


(40) 


Now, observe that for the first line of (1401) . 

RAJ - H(U’\V) 

> R A [I{X- Y) - H(U\V) + H(U'\V)] - H{JJ'\V) 

= [R-H{U'\V)]A[i{X-Y)-H{U\V)\ 

> [R- H(U\V)} A [I(X- Y) - H{U\V)\ since H{U'\V) < H(U\V) 

= RAI{X-,Y) - H{U\V), (41) 


where the first line follows from the definition of J. As for the second line, 


J - RAH{U'\V) > 

> 


> 


i(X- Y) - H(U\V) + H(U'\V ) — R A H{U'\V) 
i{X\ Y) - H(U\V) + [.H(U'\V) - R}+ 

i{X- Y) - H(U\V) + [H(U\V) - R]+ since H(U'\V) > H(U\V) 
I{X-Y)-RAH(U\V) 

RAI(X;Y) - H(U\V). (42) 


We conclude then that, no matter whether H(U'\V) < H(U\V) or H(U'\V) > H(U\V), we always 
have: 


Eo(Puv,Pwv,Pxy) > R A /(X; X)} - H(U\V) 


— El ( Puv,Pxy) 


(43) 


and so, the overall exponent Eu(R) is achieved by (TOSl) . 


As for the alternative universal decoding metric CHI), the derivation is, once again, the very 
same, with the pairwise error event A(u, u ', v, x, y) and the variable J redefined accordingly as the 
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m ini m um of I(X'\ Y) s.t. R A I(X'; Y) - H(U'\V) >R A I(X; Y) - H(U\V), and then the last two 
inequalities are modified as follows: Instead of (1411) . we have 


R/\J - H(U'\V) 

= R A R A I(X'\ Y) — H(JJ'\V) 

> R A [R A I{X- Y) - H(U\V) + H(U'\V )] - H(U'\V) 

= [R- H{U'\V )] A {[R A I(X- Y)] - H{U\V)} 

> [R-H{U\V)]A{[RAI(X-,Y)]-H(U\V)} since H{U'\V) < H(U\V) 

= RAI{X-Y)~ H(U\V). (44) 


and instead of (f42l) : 


J - RAH(U'\V) 


> RAJ - RAH{U'\V) 

> R A I(X- Y) - H{U\V) + H(U'\V) — R A H{U'\V) 

= RAI(X-Y)-H(U\V) + [H{U , \V)-R} + 

> RAI{X;Y)-H(U\V) + [H(U\V)-R] + since H(U'\V) 
= RAI{X-Y) -RAH{U\V) 

> RAI{X-Y)-H(U\V). 


> H(U\V) 
(45) 


This completes the proof of Theorem 1. 


5 Extensions 

As mentioned in the Introduction, in this section, we provide extensions of the above results in sev¬ 
eral directions, including: (i) finite-state sources and channels with LZ universal decoding metrics, 
(ii) arbitrary sources and channels with universal decoding w.r.t. a given class of metric decoders, 
and (iii) separate source-channel encodings and joint universal decoding of correlated source. While 
in (i) and (ii) we no longer expect to have single-letter formulae for the error exponent, we will 
still be able to propose asymptotically optimum universal decoding metrics in the error exponent 
sense. While the skeleton of the analysis builds upon the one of the proof of Theorem 1, we will 
highlight the non-trivial differences and the modifications needed relative to the proof of Theorem 
1. 

5.1 Finite-State Sources/Channels and a Universal LZ Decoding Metric 

la S3]’ Ziv considered the class of finite-state channels and proposed a universal decoding metric 
that is based on conditional LZ parsing. Here, we discuss a similar model with a suitable extension 
of Ziv’s decoding metric in the spirit of the generalized MMI decoder. 
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Consider a sequence of pairs of random variables {{Ui, Vj)}f =1 , drawn from a finite-alphabet, 
finite-state source, defined according to 

n 

P(u, v) = P(u t , u t |s t ) (46) 

t =1 

where St is the joint state of the two sources at time t, which evolves according to 


St — 1, Ut— 1; Vf —1 ), 


(47) 


with g : S xU xV S being the source next-state function, and S being a finite set of states. The 
initial state, si, is assumed to be an arbitrary fixed member of S. By the same token, the channel 
is also assumed to be finite-state (as in [33]), i.e., 

n 

W(y\x) = W(y t \x t ,zt), z t = h(z t -i,x t -i,y t -i), (48) 

t =i 


where zt is the channel state at time t, taking on values in a finite set Z and h : Z x X x y —>• Z 
is the channel next-state function. Once again, the initial state, z i, is an arbitrary member of Z. 

The remaining details of the communication system are the same as described in Subsection 2.2, 
with the exception that the random coding distribution, now denoted by Q(x), is allowed here 
to be more general than a uniform distribution across a type class (or the uniform distribution 
across T n , as assumed in [33]). Similarly as in [23], we assume that Q may be any exchangeable 
probability distribution (i.e., x' is a permutation of x implies Q(x') = Q(x)), and that, moreover, 
if the state variable zt includes a component, say, a t , that is fed merely by {a^} (but not {yt}), 
then it is enough that Q would be invariant within conditional types of x given a = (<7i,..., a n ). 

Let H LZ (x\y) denote the normalized conditional LZ compressibility of x given y, as defined in 
[33 , eq. (20)] (and denoted by u(x,y) therein)!^ Next define 


hz(x-, y) = _ H hZ (x\y), 


n 


and finally, define the universal decoder 


u = argrnax 

u 


I LZ (x[u\-,y) - H lz (u\v) 


(49) 


(50) 


Note that the first term on the r.h.s. of (1491) plays a role analogous to that of the unconditional 
empirical entropy, Hx{X), of the memoryless case (and indeed, at least for the uniform distribution 
over a type class, as assumed in the previous sections, it is asymptotically equivalent), and so, the 
difference in (149 1) makes sense as an extension of the empirical mutual information between x and 

y- 

9 This means that nH^ z {x\y) is the length of the conditional Lempel-Ziv code for x, where y serves as SI available 
to both the encoder and decoder, which is based on joint incremental parsing of the sequence pair (x, y) (see also 
[19] 1. Here, we are deliberately using a somewhat different notation than the usual, which hopefully makes the 
analogy to the memoryless case self-evident. 
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As in Theorem 1, part b (and as an extension to (33]), we now argue that the universal decoder 
(|501~) achieves an average error probability that is, within a sub-exponential function of n, the same 
as the average error probability of the MAP decoder for the given source (1461) and channel (1481) . 


Theorem 2 Consider the problem setting defined in Subsection 2.2, with a finite-state source |^d| ) 
and a finite-state channel Assume that the random binning ensemble is as before and that the 
random channel coding distribution Q is as described in the third paragraph of this subsection. Let 
P^ AF (n) denote the average error probability of the MAP decoder and let P u fin) denote the average 
error probability of the decoder k50\) . Then, 


V 1 , P U e (n) 

iim — log _ M , P - 

n^oo n B ( n ) 


= 0 . 


(51) 


In other words, similarly as in [33] . while we do not have a characterization of the error exponent, 
we can still guarantee that whenever the MAP decoder has an exponentially decaying average error 
probability, then so does the decoder (15U1) . and with the same exponential rate. 


Proof outline. The skeleton of the proof of Theorem 2 is similar to the proof of Theorem 1, but 
as mentioned earlier, some non-trivial modifications are needed. Below we outline the main steps, 
highlighting the main modifications required. 

1. The conditional type class of u given v , T(u\v), is redefined as the set {u' : P(u ', v ) = P(u, u)}, 
where P is given as in Obviously, for every given v, the various ‘types’ {T(u\v)} are 

equivalence classes, and hence form a partition of U. n . One important property that would essential 
for the proof is that the number K n (v) of distinct types, {T(u\v)}, under this definition, for a 
given v , grows sub-exponentially in n (just like in the case of ordinary types). This guarantees 
that the probability of a union of events, over {T{u' |u)}, is of the same exponential order as the 
maximum term, as was the case in the proof of Theorem 1. Interestingly, this can easily be proved 
using the theory of LZ data compression: 

K n (v)= Y < Y 2~ n6w( - u \ v '> + °W < 2< n \ (52) 

u&u n \'y u \ v )\ ueu n 

where o(n) stands for a sub-linear term (i.e., linp^oo o{n)/n = 0, uniformly in both u and v ), 
the first inequality is by [33) Lemma 1, p. 4590 and the second inequality is due to the fact that 
nH hZ (u\v) is (within negligible terms) a legitimate length function for lossless compression of u 
(with SI v) (see £32^ Lemma 2] and [19] ) and hence must satisfy the Kraft inequality for every given 

v. 


2. The quantity H(U'\V), in the proof of Theorem 1, is replaced by - log [T( , u , |u)| with the above 
modified definition of the conditional type. 

10 Note that the requirement P(u',v) = P(u,v) is imposed here only for the given P, not even for every finite-state 
source in the class. 

11 Not to be confused with the lemma on page 456 of [33], which is also referred to as Lemma 1. 
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3. The definition of J is changed to 

J = min | —^ \ogQ[T(x'\y)\ : x' € A(u, u', v, x, y)j, (53) 

where A(u,u' ,v,x,y) is the pairwise error event pertaining to the MAP decoder (for the lower 
bound) or to the universal decoder (15UD (for the upper bound). By our assumptions, Q assigns the 
same probability to all members of T(x'\y), thus 


iiogQiTVM] = logQ(x,) + MLPM, 

n n n 


(54) 


4. Using the above, and following the same steps as in the proof of Theorem 1, the conditional 
average error probability, given (u,v,x,y), associated with the MAP decoder, can be shown to be 
lower bounded by an expression of the exponential order of exp{— uEq(u, v, x, y)}, where 


E 0 (u,v,x,y) < 


< 


1 


min < R, -log Q[T(cc|y)] >-log \T(u\v)\ 


n 

1 


1 


n 


1 


min l R, -logQ(ic)-log \T(x\y)\ ^-log \T(u\v 


n 

1 


n 


1 


n 


min \ R, -logQ(®) - H hZ (x\y) } - H hZ (u\v ) 


n 


+ o(l) 


R A I LZ (x; y) - H lz (u\v) + o(1) 


= E*(u,v, x, y), 


(55) 


and where we have used twice Lemma 1 of [ 351 p. 459] and the fact that x £ A(u,u,v,x,y ) and 
so, for P(u',v) = P(u,v), one has J < — - log Q[T(x\y)\. 

5. For the upper bound on the error probability of (15011 . A(u, u', v, x, y) is replaced by 

A(u,u',v,x,y) = {x' : R z (x'; y) - H hZ (u'\v) > / LZ (as; y) - H hZ (u\v)} (56) 

and accordingly, J is replaced by 

J = min{I LZ (cc'; y) : I LZ (x'] y) - H LZ (u'\v) > R z (x-y) - H hZ (u\v)} 

= R z (x]y) - H hZ (u\v) + H hZ (u’\v). (57) 


6. The indicator function l[P(u',v) > P(u,v)\ is replaced by X[H LZ (u'\v) < H LZ (u\v)]. 

7. For the error probability analysis of the universal decoder (1501) , the union over erroneous source 
vectors {u'} is partitioned into (a sub-exponential number of) ‘types’ of the form 

%(u'\v) = {u : P(u,v) = P(u',v), nH LZ (u\v) = £}, (58) 

for i = 1,2,..., and one uses the fact that |7^(it'|u)| < 2 e , as nH hZ (-\v) is a length function of a 
lossless data compression algorithm. 
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8. It is observed that '}2 i T[X{i) £ A(u, u', v, x, y)\ is a binomial random variables with 2 nR trials 
and probability of success of the exponential order of 2~ nJ . To see why the latter is true, consider 
the following: 

Q jx' € A(u, u 1 , v, x , y)| 

= Q j/ LZ (X'; y ) > / LZ (*; y ) - H hZ (u\v) + H hZ {u' |n)| 

= E Q(®') 

{x 1 : / LZ (®' ; y)>/ LZ (a:;y)-£r LZ (n|n)+il LZ (n'|n)} 

= ^ Q(x') 2 n]Rl - z ( x '\y) . 2 _n ^Lz(*'|y) 

{x'-. i^ z (x'-y)>i LZ (x-,y)-H hZ (u\v)+H LZ (u'\v)} 

E exp^-n/^C®'; y)} • 2 ~ n ^ x '^ 

{x’-. i LZ (x’-,y)>i^ z (x-,y)-H LZ (u\v)+H^ z (u’\v)} 

< E exp 2 {-n[/ LZ (a:;y) - H hZ (u\v) + ^ LZ (u'|n)]} • 2~ nfl ^ x '^ y) 
x'ex n 

< exp 2 {-n[I hZ (x;y) - H hZ (u\v) + H hZ (u'\v)}} E 2~ nR ^ x '^ y) 

X'&X n 

< exp 2 {-n[I hZ (x;y) - H LZ (u\v) + H hZ (u'\v)\ + o(n)}, (59) 

where in the last step, we have used again Kraft’s inequality. 

9. Using exactly the same method as in the proof of Theorem 1, one can show that that conditional 
error probability of the universal decoder (15011 is upper bounded by an expression whose exponential 
order is lower bounded by E*(u , u, x , y). 

It should be noted that these results continue to apply for arbitrary sources and channels (even 
deterministic ones), where the assertion would be that the decoder (1501) competes favorably (in the 
error exponent sense) relative to any decoding metric of the form 

n n 

E rn s (u t ,v t , s t ) + E m c(x t ,yt, z t ), (60) 

t =l t =l 

where St and zt evolve according to next-state functions h and g , as defined above. This follows from 
the observation that the assumption on underlying finite-state sources and finite-state channels 
was actually used merely in the assumed structure of the MAP decoding metric, with which decoder 
(1501) competes. The fact that the overall probability of error is eventually averaged over all source 
vectors and channel noise realizations pertaining to finite-state probability distributions, was not 
really used here, since we compared the conditional error probabilities given (u, v,x,y). The same 
observation has been exploited also in [25] for universal pure channel coding, and it will be further 
developed in the next subsection. 
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5.2 Arbitrary Sources and Channels With a Given Class of Metric Decoders 


In [25], the following setting of universal channel decoding was studied: Given a random coding 
distribution Q, for independent random selection of 2 nR codewords {ajj}, and given a limited class 
of reference decoders, defined by a family of decoding metrics {mg(x, y), 9 € 0} (9 being an index 
or a parameter), find a decoding metric that is universal in the sense of achieving an average error 
probability that is, within a sub-exponential function of n, as good as the best decoder in the class, 
no matter what the underlying channel, W(y\x), may be. The following decoder was shown in 
[25] to possess this property under a certain condition that will be specified shortly: estimate the 
message i as the one that minimizes u(xi,y ) = log Q[T(xi\y)}, where T{x\y) designates a notion 
of a “type” induced by the family of decoding metrics (rather than by channels), namely, 

T{x\y) = {x'\ me(x',y) = m e {x,y) V 6 <E 0} . (61) 

As {T(x\y)} are equivalence classes, they form a partition of X n for every given y. The condition 
required for the universality of this decoding metric is that the number of distinct ‘types’ {T(x\y)} 
would grow sub-exponentially with n. 

A similar approach can be taken in the present problem setting. Given a family of decoding 
metrics of the forni^l 


me(u,v,x,y) = m St g(u,v) + m C! g(x,y), 9 € 0, (62) 

let us define 

T s (u\v) = {u’: m a j(u',v) = m St g(u,v) V 9 <5 0} (63) 

T s (x\y) = {x': m c , e {x',y) = m Ct e(x,y) V 9 € 0}, (64) 

and assume, as before, that the numbers of distinct ‘types’, {7^(it|n)} and {T c (x\y)}, both grow 
sub-exponentially with to. Then, the universal decoder 

u = argmin {log \T s (u\v)\ + log Q[T c (x[u\\y)}} (65) 

competes favorably with all metrics in the above family, no matter what the underlying source and 
the underlying channel may be. The proof combines the ideas of the proof of Theorem 1 above with 
those of [25], with the proper adjustments, of course, but it is otherwise straightforward. Here, 
the term log |7^(tt|n)| is the analogue of n times the conditional empirical entropy pertaining to 
the source part, whereas the term log Q[T c (x[u]\y) plays the role of n times the negative empirical 
mutual information between x[u] and y. Therefore if, for example, 

n 

m c ,e(x,y) = ^m cfi (x t ,yt) and (66) 

t =l 


12 This additive structure can be justified by the fact that the MAP decoding metric is also additive, as it maximizes 
log P(u,v) + log W (y\x[u\). 
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( 67 ) 


n 

m Bj g{u,v ) = ^ 2 m s,o{ut, v t ), 

t=l 

as is the case when the sources and the channel are memoryless, then {T c (x\y)} and {7^(w|u)} 
become conditional type classes in the usual sense, and we are back to the generalized MMI decoder 
of Section 3, provided that Q is, again, the uniform distribution within a single type class. As a 
final note, in this context, we mention that in this setting, the input and the output alphabets of 
the channel may also be continuous, see, e.g., [221 p. 5575, Example 3]. 

5.3 Separate Encodings and Universal Joint Decoding of Correlated Sources 

Consider the system depicted in Fig. [21 which illustrates a scenario of separate source-channel 
encodings and joint decoding of two correlated sources, u\ and u 2 . For the sake of simplicity of 
the presentation, we return to the assumption of memoryless systems, as in Section 3. 



Figure 2: Separate source-channel encodings and joint decoding of two correlated sources. 

Consider n independent copies t/ 2 ,i)}?=i of a finite-alphabet pair of random variables 

(U U U 2 ) Pu 1 u 2 j as well as n uses °f two independent finite-alphabet DMC’s W\{y-i\x{) = 
n"=i and W 2 (y 2 1*2) = Xlt=\ W 2 (y 2tt \ x 2 ,t)■ For k = 1,2, consider the following mecha¬ 

nism: The source vector Uk = (ilk, 1 ,... ,Uk, n ) is encoded into one out of Mk = 2 nRk bins, selected 
independently at random for every member of UJ}. The bin index jk = fk( u k ) is in turn mapped 
into a channel input vector Xk(i) £ X™, which is transmitted across the channel Wk- The various 
codewords {xk(i)}f=± are selected independently at random under the uniform distribution within 
given type classes T(Qk), where Qk is a given distribution across Xk. The randomly chosen code¬ 
book {ccfc(l), ccfc(2),.. ., Xk(Mk)} will be denoted by Ck■ Similarly, as before, we will sometimes 
denote Xk(jk) = *fc[/fc(wfc)] by Xk[uk]. The optimal (MAP) decoder estimates ( 111 , 112 ), using the 
channel outputs y 1 and y 2 , according to 

(' ui,u 2 ) = arg max P{ui, u^W^y^x^u^W-^y^x^u^). (68) 
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The main structure of the analysis continues to be essentially the same as in Section 4. The 
situation here, however, is significantly more involved, because five different types of pairwise error 
events {( 111 , 1 ( 2 ) —> (u\,u 2 )} should be carefully handled: 

1 . 7 ^ u\ and u ' 2 = « 2 - 

2. u 2 7 ^ u 2 and u\ = U\. 

3. Both u'i u\ and u ' 2 7 ^ w 2 , but (at least) u ' 2 is mapped into the same bin as tt 2 . 

4. Both u\ 7 ^ u\ and u 2 7 ^ w 2 , but (at least 1^*1 u\ is mapped into the same bin as U\. 

5. Both u\ 7 ^ u\ and u 2 7 ^ U2, and neither u\ nor u ' 2 belongs to the same bin as the respective 

true source vector. 

Errors of types 1 and 2 are of the same nature as in Section 3, where the source that is estimated 
correctly, is actually in the role of SI at the decoder. Following (fTUD . the respective metrics arJ^l 


u 2 ,x l ,x 2 ,y 1 ,y 2 ) = Ri A I{X l -Y 1 ) - I7(£/i|17 2 ) (69) 

f 2 (u 1 ,U 2 ,x l ,X 2 ,y 1 ,y 2 ) = Ri A i(X 2 -,Y 2 ) - H(U 2 \Ui). (70) 

where I(X\\Y{) and i7(f7i|t/ 2 ) are shorthand notations for Ix\X 2 (Ah i ^ 1 ) and H Ul u 2 {U\\U 2 ), re¬ 
spectively, and so on. Errors of types 3 and 4 will turn out to be addressed by metrics of the 
form 


f 3 (ui,U2,x 1 ,X2,yi,y 2 ) = R\ A I(Xi; Yi) + R 2 — H(Ui,U 2 ) (71) 

f4,(ui,U2,x 1 ,x 2 ,y 1 ,y 2 ) = R 2 A I{X 2 \ Y 2 ) + R\ - H(Jh, U 2 ). (72) 

Finally, error of type 5 is accommodated by 

f5{ui,U2,x 1 ,X2,y 1 ,y 2 ) = i{X\\Y x ) + I(X 2 :Y 2 ) - 

min{#(17!, U 2 ),Ri A I{X r ,Y i) + R 2 A J(X 2 ; Y 2 )} 

= [i?r A I(X r , Yt) + R 2 /\ I{X 2 ; Y 2 ) - H(U U U 2 )}+ + 

[I{Xr,Y 1 ) - R 3 } + + [i(Xr,Y 2 ) - R 2 }+ (73) 

But we need a single universal decoding metric that copes with all five types of errors at the same 
time. 

Similarly as in [25) eqs. (57)-(60)], this objective is accomplished by a metric which is given by 
the minimum among all five metrics above, i.e., we define our decoding metric as 

/o(ui, u 2 , xi,x 2 , 2 / 1 , y 2 ) = min f i (ui,U 2 ,x 1 ,x 2 ,y 1 ,y 2 ), (74) 

1<2<5 

Our main result in this subsection is the following. 

13 Here, we are counting twice the case “uj 7 ^ ui and u 2 u 2 and both estimates are in the bins of their respective 
true source vectors.” This is done simply for symmetry the structure above, without affecting the error exponent. 
14 Note that /1 does not really depend on ( x 2 ,y 2 ), and similarly, f 2 does not depend on (sci, 2/j_). Nonetheless, we 
deliberately adopt this uniform notation for convenience later on. 
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Theorem 3 Consider the above described setting of separate encodings and joint decoding of two 
correlated memoryless sources transmitted over two respective, independent memoryless channels. 
Then, the universal decoder 

(ui,u 2 ) = arg max / 0 («i, u 2 , ®i[«i], x 2 [u 2 ], y x , y 2 ) (75) 

ill , LL2 

achieves the same random-binning/random-coding error exponent as the MAP decoder fUB J). 

Proof outline. The conditional probability of error given (ui, u 2 , x\, x 2 , y 1 , y 2 ), for both the 
MAP decoder and the universal decoder, can be shown to be of the exponential order of 

exp 2 {—n[/ 0 («i, u 2 , ®i, x 2 , y 1 ,y 2 )] + }. 

To show this, the analysis of the probability of error, for both the MAP decoder and the universal 
decoder, should be divided into several parts, according to the various types of error events. Errors 
of types 1 and 2 are addressed exactly as in Section 4. The more complicated part of the analysis is 
due to errors of types 3—5, where both competing source vectors are in error. However, this analysis 
too follows the same basic ideas. Here we will outline only the main ingredients that are different 
from those of the proof of Theorem 1. 

For a given u\ ^ u\ and u 2 7 ^ u 2 (errors of types 3-5), let us define the pairwise error event 
A(u 1 ,u' 1 ,u 2 ,u 2 ,x 1 ,x 2 ,y 1 ,y 2 ) 

= [T(Qi) x t{q 2 )\ n 

{(*i) * 2 ) : p (u' 1 ,u 2 )W 1 (y l \x' 1 )W 2 {y 2 \x 2 ) > P(u 1 ,u 2 )W 1 (y 1 \x 1 )W 2 (y 2 \x 2 )} . 

The conditional error event, given (m, u 2 , X\, x 2 , y±, y 2 , C\, C 2 ), is given by 

£(ui,u 2 ,xi,x 2 ,y 1 ,y 2 ,Ci,C 2 ) 

= (J {P{u' 1 ,u , 2 )W 1 (y 1 \x 1 [u , 1 })W 2 {y 2 \x 2 [u' 2 \) > 

u’^Uu u’ 2 ^u 2 

P(u 1 ,u 2 )W 1 (y 1 \x 1 [u 1 ])W 2 (y 2 \x 2 [u 2 })} 

= (J £{u 1 ,u' 1 ,u 2 ,u 2 ,x 1 ,x 2 ,y 1 ,y 2 ,C 1 ,C 2 ) (76) 

U'^Ui 

Here too, the exponential tightness of the truncated union bound for two dimensional unions of 
events with independence structure as above can be established using de Caen’s lower bound [5] 
(see [20]). For errors of types 3 and 4, let us define 

Ai{ui,u' 1 ,u 2 ,u' 2 ,x 1 ,y l ) 

= T(Qi)nW : P{.u' 1 ,u 2 )W 1 (y 1 \x , 1 ) > P{u l ,u 2 )W l {y l \x l )} (77) 

and 


A 2 (ui,u[, u 2 ,u 2 ,x 2 ,y 2 ) 
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= T(Q 2 )p|{*2 : P(u' 1 ,u , 2 )W 2 (y 2 \x 2 ) > P{ui,u 2 )W 2 (y 2 \x 2 ).} (78) 

The probability of £(ui,u' 1 ,u 2 ,u 2: Xi,x 2 ,yi,y 2 ,Ci,C 2 ) (w.r.t. the randomness of the bin assign¬ 
ment) is given by: 


Pr{£’(ui, u 

, 11 u 2 ,u' 2l xi,x 2 ,y 1 ,y 2 ,Ci,C 2 )} 


2~ n(i?i+i?2) 

[Ci x C 2 \f)A(ui,u' 1 ,u 2 ,u 2 ,xi,x 2l y 1 ,y 2 ) 

2~ n(i?i+i?2) 

Cif]Ai(ui,u' 1 ,u 2 ,u 2 ,xi,y 1 ) 

+ 

2~ n(Ri~\-R2) 

C 2 ^A 2 {ui^u 2 ^x 2 ,y 2 ) 

+ 


+‘r n ^ +R ^Z{P{u\,u' 2 ) > P( Ul ,u 2 )}, 


+ 


(79) 


where the first term stands for errors of type 5, the second and third terms represent errors of types 
3 and 4, and the last term is associated with an error where both u\ A u\ and u 2 A u\, but the 
respective bins both coincide. Passing temporarily to shorthand notation, let us denote 




[C\ xC 2 ]f]A + Cif]Ai + C 2 P| A 2 


+1{P{u[,u , 2 ) > P(ui,u 2 )} = N i2 +Ni+N 2 +I. (80) 


The next step, as before, is to average over the randomness of all codewords in C\ and C 2 , To 
analyze the large deviations behavior of N± 2 + Ni + N 2 , the contributions of the individual random 
variables can be handled separately, since Pr{A^i 2 + iVi-(-A ^2 > threshold} is of the same exponential 
order of the sum 


Pr{lVi 2 > threshold} + PrjA^i > threshold} + PrjA^ > threshold}. 

Now, Ni and N 2 are binomial random variables whose numbers of trials are 2 nRl and 2 n ^ 2 , re¬ 
spectively, and whose probabilities of success decay exponentially according to the relevant channel 
mutual informations, similarly as before. So their contributions are again analyzed with great 
similarly to those of type 1 and type 2 errors. 

Finally, it remains to handle N\ 2 , which not a binomial random variables, but it can be decom¬ 
posed as the sum (over combinations of conditional types of x[ given y 1 and of x 2 given y 2 ) of 
products of independent binomial random variables, for which we reuse the notations N\ and N 2 
(for a given combination of such types). Using the same techniques as in [5T, Chap. 6], one can 
easily obtain the following generic result concerning the large deviations behavior of N\ ■ N 2 : If 
N\ is a binomial random variable with 2 nAl trials and probability of success 2~ nBl and N 2 is an 
independent binomial random variable with 2 n ^ 2 trials and probability of success 2~ nB2 , then 

Pr{IVi ■ N 2 > 2 nC } = max Pr{Ad > 2 na } ■ Pr{IV 2 > 2"-(<?-“)} 

0<a<C 

= 2~ nE ( 81 ) 
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with 


E = 


[B i — -Ai] + + [B 2 — A 2 ] + C < [Ai — B\] + + [A -2 — B 2 \ 
00 C > [A\ — -E>i]+ + [A 2 — B 2 ] 


(82) 


Using this fact, it is possible to obtain the contribution of the type 5 error event. 

Upon carrying out the analysis along these lines, the state of affairs turns out to be as described 
next. In the analysis of the conditional probability of error, the contribution of a given type class, 
’T(u' 1 ^u' 2 ), of competing source vectors, which are encoded into x\ and x 2 (from given conditional 
type classes given y x and y 2 , respectively) is the following: the probability of error of type i is of the 
exponential order of exp{—n[/j(u' l5 u' 2l x[, x 2 , y Xl y 2 )\ + }, i = 1,... ,5. Thus, the total conditional 
error probability contributed by this combination of types is of the exponential order of 

5 

exp {-n[fi(u\ ,u 2 ,x' 1 ,x 2 ,y 1 ,y 2 )} + } = exp{-n min[/, {u\ ,u 2 l x' 1 ,x 2 ,y 1 ,y 2 )\ + } 

i= 1 ' 

= ex.p{-n[f 0 (u' 1 ,u' 2 ,x' 1 ,x' 2 ,y 1 ,y 2 )} + }. (83) 

For the total contribution of all type classes, the exponent [fo(u' 1 ,u' 2 ,x' 1 ,x' 2 ,y 1 ,y 2 )\ + should be 
minimized over all such combinations of types (that yield the relevant pairwise error event). An 
upper bound on this exponent is obtained by selecting the same combination of types as those of the 
correct source vectors (instead of taking this minimum), namely, the conditional error probability of 
the MAP decoder is simply lower bounded by the exponential order of exp{—n[/o(ui, u 2 , x\,x 2 , y l , ^ 2 )] + }- 
As for the universal decoder, one should minimize the exponent [/o {u ^, u 2 , x\. x 2 , y\,y 2 )]+ as well, 
but only over the combinations of type classes that are associated with the pairwise error event 
of this decoder, namely, those for which fo(u' 1 ,u 2 ,x' 1 ,x 2 ,y 1 ,y 2 ) > fo(ui,u 2 ,xi,x 2 ,y 1 ,y 2 ). How¬ 
ever, this minimum is exactly [/o(wi, u 2 , x\, x 2 , y 1 , y 2 )]+, which agrees with that of the upper 
bound associated with the MAP decoder. 

More formally, denoting fo(ui,u 2 ,xi,x 2 ,y 1 ,y 2 ) as a functional of the relevant joint empirical 
distributions, i.e., 

F(P Ul U 2 ,Px iyi ,Px 2 ,y 2 ) = [fo(ui,u 2 ,x 1 ,x 2 ,y 1 ,y 2 )}+, (84) 


the error exponent achieved by both the MAP decoder and the universal decoder is given by 
E(R 1 ,R 2 ,Q 1 ,Q 2 ) = min i {D{P u , u ,\\P UlUi ) + D(W{\\W 1 \Q 1 )+ 

P Utu'’ W i’ W 2 1 


D(W^\\W 2 \Q 2 ) + F(Pu'u',Qi X W[Xh X Wp}. (85) 


Note that when R\ and R 2 are sufficiently large, neither fy nor f 4 would achieve /o. At the same 
time, / 1 , f 2 and /s degenerate as follows: 


fi(u 1 ,u 2 ,x 1 ,x 2 ,y 1 ,y 2 ) = I(Xr, Yj) - iP(Ui|C/ 2 ) (86) 

f 2 (u 1 ,u 2 ,x 1 ,x 2 ,y 1 ,y 2 ) = I(X 2 ] Y 2 ) ~ H(U 2 \U x ) (87) 
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h(u 1 ,u 2 ,x l ,x 2 ,y 1 ,y 2 ) = [J(X i; H) + I(X 2 ; Y 2 ) - H(U U U 2 )}+. 


( 88 ) 


Therefore, we have 

E{ oo, oo, Qi, Q 2 ) 


mm 

Pu'u'WlW 1 


D{P u[ul2 \\P UlU2 ) + DiWiWWilQ!) + D{WZ\\W 2 \Q 2 )+ 


mm{I(X 1 -Y{) - H{U[\U' 2 ),I{X 2 -X) ~ H{U' 2 \U[) 
I(XiX) + I(X 2 - Y') - H(U[, U’ 2 )}\ . 


(89) 


Further restricting this to the case of noiseless bit-pipes at fixed transmission rates r\ and r 2 , 
respectively, the above channel-related divergence terms vanish, and one obtains the error exponent 
of separate compression and joint decompression of correlated sources 


mm 


u'u' 


D(P uiuL \\Pu lUa ) + mfo{ri - H(U[ |C/'),r 2 - H{U' 2 \U[), n + r 2 - H(U[,U')} 


(90) 


in agreement with [U Exercise 13.5] (second edition). 


Appendix - Proof of Eq. (Ej) 


First, observe that 


(1+P)ln £[P(«)] 1/(1+ '> 


\udlA 


m.ax.[pH(P') — D(P'\\P)], 


(A.l) 


as can easily be seen by solving explicitly the maximization on the r.h.s. Next observe that for p > 0, 
the maximizer Pq is always associated with an entropy larger^ than 'H(P) = H{U). Therefore, the 
above identity can be further developed to obtain 


(l+p)ln mPMlW+o) 

\u&A J 

\p'H{P') - D(P’\\P)} 


max 

{P': H(P)<H{P’)<log\U\} 


max max [pT-L{P') — D(P'\\P)] 

H(P)<R<log\U\{P': H(P')=R} 

max max [pR — D(P'\\P)] 

H{P)<R<\og\U\{P’: H(P')=R} 


max pR— min D(P'\\P) 

H(P)<R<\og\U\ L {P'- H(P')=R} 

max [pR — E S (R)}, (A. 2) 

H(P)<R<log\U\ 


15 To see why this is true, note that pH(P) < max P > [pH(P') — D(P'\\P)] = pH(Po) — D(Pq\\P) < pH(Po). 
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where the last step follows from the fact that, due to the convexity and the monotonicity of the 
source coding exponent function, the constraint TL(P') > R, of the minimization of D(P'\\P) that 
defines it, is attained with equality in the range R G (H{P), log \U\] (see also J2J eq. (7)]). Therefore, 


max 

0<p<l 


Eq(p,Q) ~ (1 + p) In 


(E[P(»)] 1/(I+ '’ ) ')} 

\uGU / J 


= max Eq(p,Q)— max [pR — E s (R)\ 
0<P<1 [ H(P)<R<log\U\ 

= max min [Eo(p,Q) — pR + E B (R)] 
0<P<^H(P)<R<\og\U\ 

= min max [Eq(p, Q) — pR + E S (R )] 
H(P)<R< log \u\ 0<P<1 

min [E((R,Q) + E S (R)\, 
H(P)<R<\og\U\ 


(A.3) 


where the interchange of the maximization and the minimization in the second to the last step is 
allowed by the concavity of Eo(p,Q) in p [9l eq. (5.6.26)] and the convexity of E B (R) [2j. 
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